Skip to content

Use shared statistics merge for union stats#21430

Merged
kumarUjjawal merged 7 commits intoapache:mainfrom
kumarUjjawal:feat/consolidated_satistics
Apr 30, 2026
Merged

Use shared statistics merge for union stats#21430
kumarUjjawal merged 7 commits intoapache:mainfrom
kumarUjjawal:feat/consolidated_satistics

Conversation

@kumarUjjawal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

DataFusion already has shared logic for merging Statistics, but UnionExec and InterleaveExec still used their own local merge code.

That left duplicated path in the codebase and kept the behavior less consistent than the other statistics aggregation paths.

What changes are included in this PR?

  • Reuse Statistics::try_merge_iter for UnionExec statistics merging
  • Reuse the same shared path for InterleaveExec statistics merging
  • Remove the local union-specific statistics merge helpers
  • Add tests for union and interleave statistics merging
  • Add a test for interleave partition-level statistics merging

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 7, 2026
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

What's up with the failing avro test? I don't think it is related to this pr.

}

#[test]
fn test_union_distinct_count() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a significant concern with the test coverage reduction.

The old test_union_distinct_count contained 15 carefully constructed edge cases for NDV estimation: disjoint ranges, identical ranges, partial overlap, containment, constant values, absent values, and mixed precision types.

These tests validated the correctness of estimate_ndv_with_overlap as used in the union context. The new tests use a single scenario (non-overlapping ranges, single column) and rely on the assumption that try_merge_iter is already well-tested elsewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have plans for the tests in the shared statistics, I am working on adding those.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I converted the PR to draft now

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

Comment thread testing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have happened, my mistake

@xudong963 xudong963 marked this pull request as draft April 7, 2026 07:14
@github-actions github-actions Bot added the common Related to common crate label Apr 7, 2026
@kumarUjjawal kumarUjjawal marked this pull request as ready for review April 7, 2026 08:40
@kumarUjjawal kumarUjjawal requested a review from xudong963 April 7, 2026 08:50
@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

@xudong963 Can i get a look on this?

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

I noticed the NDV fallback change from sum to max for bound-less inputs is a silent accuracy regression, wdyt

Yeah, I changed it.

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @asolimando in case you have a chance to have a look at this.

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup consolidating the union/interleave stats merging onto a single function @kumarUjjawal.

There is still a little gap in test coverage and I think it would be interesting to keep a customizable fallback for NDV merging so the change becomes a pure refactoring with no semantic changes.

Thanks @xudong963 for the ping!

Comment thread datafusion/common/src/stats.rs Outdated
(Some(&l), Some(&r)) => Precision::Inexact(
estimate_ndv_with_overlap(col_stats, item_cs, l, r)
.unwrap_or_else(|| usize::max(l, r)),
.unwrap_or_else(|| l.saturating_add(r)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change at this line is a semantic, the proposed fallback is sensible for unions (independent streams, summing NDVs is a good upper bound) but this function is also used to share statistics for Parquet files (see statistics.rs#L482 and statistics.rs#L528), for which max is a more classic fallback (files from the same table are likely to share common values, so summing NDV would overshoot in general).

One option would be to have a configurable fallback (e.g., an enum NdvFallback::Max vs NdvFallback::Sum), so the callers can choose based on their own semantics. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an improvement from the current approach. Thanks @asolimando

}

#[test]
fn test_union_distinct_count() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still some regression in terms of test coverage:
the old test_stats_union covered multi-column merging with mixed types (Int64, Utf8, Float32) and mixed absent/present stats across columns. The new tests use a single UInt32 column with all stats present.

Could you add a multi-column test case (e.g., 2-3 columns with different types, some with absent stats) to close the gap?

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 23, 2026
@kumarUjjawal kumarUjjawal force-pushed the feat/consolidated_satistics branch from f8fd060 to 891da8f Compare April 23, 2026 04:30
Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pending comments were fully addressed, thanks @kumarUjjawal!

EDIT: I noticed there is a test failure, but it seems a flaky test, and unrelated to this PR, so I keep my approval, not sure if guidelines require a green run, in case you can probably re-trigger with an empty commit? (I don't have permission to re-run selectively, and I guess you don't either)

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

@asolimando
Copy link
Copy Markdown
Member

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

@asolimando
Copy link
Copy Markdown
Member

The CI failure looks lie a flaky EXPLAIN ANALYZE expectation.

If you have bandwidth, would you mind filing an issue for this?

I was having issue with this in another pr and I pushed a fix there yesterday, hope that's okay 94d8f7d

or should i create new issue to address this?

No no, that's fine, I missed that and I wanted to make sure this wouldn't slip through the cracks!

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

@xudong963 If this looks good I will merge this now. Can you take a look?

@xudong963
Copy link
Copy Markdown
Member

@xudong963 If this looks good I will merge this now. Can you take a look?

good for me

@kumarUjjawal
Copy link
Copy Markdown
Contributor Author

Thank you @xudong963 and @asolimando

@kumarUjjawal kumarUjjawal enabled auto-merge April 30, 2026 13:41
@kumarUjjawal kumarUjjawal added this pull request to the merge queue Apr 30, 2026
@github-actions
Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning origin/main
    Building datafusion-common v53.1.0 (current)
error: running cargo-doc on crate 'datafusion-common' failed with output:
-----
   Compiling proc-macro2 v1.0.106
   Compiling unicode-ident v1.0.24
   Compiling quote v1.0.45
   Compiling libc v0.2.186
   Compiling autocfg v1.5.0
   Compiling libm v0.2.16
    Checking cfg-if v1.0.4
   Compiling num-traits v0.2.19
    Checking memchr v2.8.0
   Compiling syn v2.0.117
   Compiling find-msvc-tools v0.1.9
   Compiling shlex v1.3.0
    Checking itoa v1.0.18
   Compiling zerocopy v0.8.48
   Compiling serde_core v1.0.228
   Compiling jobserver v0.1.34
   Compiling zmij v1.0.21
    Checking bytes v1.11.1
   Compiling cc v1.2.61
   Compiling serde_json v1.0.149
    Checking num-integer v0.1.46
    Checking iana-time-zone v0.1.65
    Checking siphasher v1.0.2
   Compiling version_check v0.9.5
   Compiling getrandom v0.3.4
    Checking stable_deref_trait v1.2.1
   Compiling ahash v0.8.12
    Checking phf_shared v0.12.1
    Checking chrono v0.4.44
    Checking num-bigint v0.4.6
   Compiling chrono-tz v0.10.4
    Checking phf v0.12.1
    Checking once_cell v1.21.4
    Checking num-complex v0.4.6
   Compiling synstructure v0.13.2
    Checking arrow-schema v58.1.0
    Checking hashbrown v0.16.1
    Checking lexical-util v1.0.7
    Checking litemap v0.8.2
    Checking writeable v0.6.3
   Compiling pkg-config v0.3.33
   Compiling icu_normalizer_data v2.2.0
    Checking smallvec v1.15.1
   Compiling object v0.37.3
   Compiling icu_properties_data v2.2.0
   Compiling zstd-sys v2.0.16+zstd.1.5.7
    Checking utf8_iter v1.0.4
   Compiling zerocopy-derive v0.8.48
   Compiling zerofrom-derive v0.1.7
   Compiling yoke-derive v0.8.2
   Compiling zerovec-derive v0.11.3
    Checking zerofrom v0.1.7
    Checking yoke v0.8.2
   Compiling displaydoc v0.2.5
    Checking zerotrie v0.2.4
    Checking zerovec v0.11.6
    Checking tinystr v0.8.3
    Checking icu_locale_core v2.2.0
    Checking potential_utf v0.1.5
    Checking icu_collections v2.2.0
    Checking icu_provider v2.2.0
   Compiling semver v1.0.28
   Compiling rustc_version v0.4.1
    Checking lexical-write-integer v1.0.6
    Checking lexical-parse-integer v1.0.6
   Compiling zstd-safe v7.2.4
    Checking lexical-parse-float v1.0.6
    Checking lexical-write-float v1.0.6
    Checking icu_properties v2.2.0
    Checking half v2.7.1
    Checking arrow-buffer v58.1.0
    Checking icu_normalizer v2.2.0
    Checking arrow-data v58.1.0
   Compiling flatbuffers v25.12.19
    Checking aho-corasick v1.1.4
    Checking arrow-array v58.1.0
    Checking pin-project-lite v0.2.17
    Checking futures-sink v0.3.32
   Compiling parking_lot_core v0.9.12
    Checking ryu v1.0.23
    Checking unicode-segmentation v1.13.2
   Compiling ar_archive_writer v0.5.1
    Checking regex-syntax v0.8.10
    Checking unicode-width v0.2.2
    Checking arrow-select v58.1.0
    Checking futures-core v0.3.32
    Checking base64 v0.22.1
    Checking futures-channel v0.3.32
    Checking comfy-table v7.2.2
   Compiling psm v0.1.31
    Checking regex-automata v0.4.14
    Checking arrow-ord v58.1.0
    Checking idna_adapter v1.2.2
    Checking lexical-core v1.0.6
   Compiling futures-macro v0.3.32
    Checking atoi v2.0.0
    Checking bitflags v2.11.1
    Checking scopeguard v1.2.0
    Checking allocator-api2 v0.2.21
    Checking twox-hash v2.1.2
   Compiling thiserror v2.0.18
    Checking alloc-no-stdlib v2.0.4
    Checking futures-io v0.3.32
    Checking slab v0.4.12
    Checking futures-task v0.3.32
    Checking foldhash v0.2.0
    Checking percent-encoding v2.3.2
    Checking equivalent v1.0.2
    Checking form_urlencoded v1.2.2
    Checking hashbrown v0.17.0
    Checking futures-util v0.3.32
    Checking regex v1.12.3
    Checking alloc-stdlib v0.2.2
    Checking lz4_flex v0.13.0
    Checking lock_api v0.4.14
    Checking arrow-cast v58.1.0
    Checking idna v1.1.0
   Compiling thiserror-impl v2.0.18
   Compiling stacker v0.1.24
   Compiling ring v0.17.14
    Checking csv-core v0.1.13
   Compiling paste v1.0.15
    Checking either v1.15.0
    Checking simdutf8 v0.1.5
   Compiling getrandom v0.4.2
   Compiling snap v1.1.1
    Checking itertools v0.14.0
    Checking csv v1.4.0
    Checking parking_lot v0.12.5
    Checking url v2.5.8
    Checking brotli-decompressor v5.0.0
    Checking indexmap v2.14.0
   Compiling async-trait v0.1.89
   Compiling tokio-macros v2.7.0
    Checking zstd v0.13.3
    Checking arrow-ipc v58.1.0
    Checking http v1.4.0
    Checking ordered-float v2.10.1
    Checking getrandom v0.2.17
    Checking untrusted v0.9.0
    Checking integer-encoding v3.0.4
    Checking zlib-rs v0.6.3
    Checking byteorder v1.5.0
    Checking humantime v2.3.0
    Checking object_store v0.13.2
    Checking thrift v0.17.0
    Checking tokio v1.52.1
    Checking arrow-json v58.1.0
    Checking flate2 v1.1.9
    Checking brotli v8.0.2
    Checking arrow-csv v58.1.0
    Checking futures v0.3.32
    Checking arrow-string v58.1.0
    Checking arrow-arith v58.1.0
    Checking arrow-row v58.1.0
   Compiling recursive-proc-macro-impl v0.1.1
   Compiling sqlparser_derive v0.5.0
    Checking log v0.4.29
   Compiling seq-macro v0.3.6
    Checking recursive v0.1.1
    Checking arrow v58.1.0
    Checking uuid v1.23.1
    Checking hex v0.4.3
    Checking sqlparser v0.61.0
    Checking parquet v58.1.0
error[E0432]: unresolved import `object_store::buffered`
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:25:19
    |
 25 | use object_store::buffered::BufWriter;
    |                   ^^^^^^^^ could not find `buffered` in `object_store`
    |
note: found an item that was configured out
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/object_store-0.13.2/src/lib.rs:545:9
    |
544 | #[cfg(feature = "tokio")]
    |       ----------------- the item is gated behind the `tokio` feature
545 | pub mod buffered;
    |         ^^^^^^^^

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:98:13
    |
 98 | /             self.w
 99 | |                 .put(bs)
100 | |                 .await
    | |______________________^ cannot infer type

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:107:13
    |
107 | /             self.w
108 | |                 .shutdown()
109 | |                 .await
    | |______________________^ cannot infer type

Some errors have detailed explanations: E0282, E0432.
For more information about an error, try `rustc --explain E0282`.
error: could not compile `parquet` (lib) due to 3 previous errors
warning: build failed, waiting for other jobs to finish...

-----

error: failed to build rustdoc for crate datafusion-common v53.1.0
note: this is usually due to a compilation error in the crate,
      and is unlikely to be a bug in cargo-semver-checks
note: the following command can be used to reproduce the error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/datafusion/datafusion/datafusion/common --features backtrace,force_hash_collisions,object_store,parquet,parquet_encryption,recursive_protection,sql,sqlparser &&
          cargo check &&
          cargo doc

    Building datafusion-datasource v53.1.0 (current)
       Built [  36.161s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.030s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  35.749s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.225s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  73.429s] datafusion-datasource
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  32.004s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.119s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  31.867s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.122s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.649s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  65.956s] datafusion-physical-plan
error: aborting due to failure to build rustdoc for crate datafusion-common v53.1.0

Merged via the queue into apache:main with commit d09a919 Apr 30, 2026
39 checks passed
@kumarUjjawal kumarUjjawal deleted the feat/consolidated_satistics branch April 30, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants